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METHODS, SOFTWARE ARRANGEMENTS, STORAGE MEDIA, AND 
SYSTEMS FOR PROVIDING A SHRINKAGE-BASED SIMILARITY METRIC 



CROSS REFERENCE TO RELATED APPLICATION 

This application claims priority from U.S. Patent Application Serial 
No. 60/464,983 filed on April 24, 2003, the entire disclosure of which is incorporated 
herein by reference. 



10 FIELD OF THE INVENTION 

The present invention relates generally to systems, methods, and 
software arrangements for determining associations between one or more elements 
contained within two or more datasets. For example, the embodiments of systems, 
methods, and software arrangements determining such associations may obtain a 
1 5 correlation coefficient that incorporates both prior assumptions regarding two or more 
datasets and actual information regarding such datasets. 

* 

BACKGROUND OF THE INVENTION 

Recent improvements in observational and experimental techniques 
20 allow those of ordinary skill in the art to better understand the structure of a 
substantially unobservable transparent cell. For example, microarray-based gene 
expression analysis may allow those of ordinary skill in the art to quantify the 
transcriptional states of cells. Partitioning or clustering genes into closely related 
groups has become an important mathematical process in the statistical analyses of 
25 microarray data. 

Traditionally, algorithms for cluster analysis of genome-wide 
expression data from DNA microarray hybridization were based upon statistical 
properties of gene expressions, and result in organizing genes according to similarity 
in pattern of gene expression. These algorithms display the output graphically, often 
30 in a binary tree form, conveying the clustering and the underlying expression data 
simultaneously. If two genes belong to the same cluster (or, equivalently, if they 
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belong to the same subtree of small depth), then it may be possible to infer a common 
regulatory mechanism for the two genes, or to interpret this information as an 
indication of the status of cellular processes. Furthermore, a coexpression of genes of 
known function with novel genes may result in a discovery process for characterizing 
5 unknown or poorly characterized genes. In general, false negatives (where two 
coexpressed genes are assigned to distinct clusters) may cause the discovery process 
to ignore useful information for certain novel genes, and false positives (where two 
independent genes are assigned to the same cluster) may result in noise in the 
information provided to the subsequent algorithms used in analyzing regulatory 
10 patterns. Consequently, it may be important that the statistical algorithms for 
clustering are reasonably robust. Nevertheless, the microarray experiments that can 
be carried out in an academic laboratory at a reasonable cost are minimal, and suffer 
from an experimental noise. As such, those of ordinary skill in the are may use 
certain algorithms to deal with small sample data. 

« 

15 One conventional clustering algorithm is described in Eisen et al 

("Eisen"), Proc. Natl Acad. Sci. USA 95, 14863-14868 (1998). In Eisen, the gene- 
expression data were collected on spotted DNA microarrays (See, e.g., Schena et al 
("Schena"), Proc. Natl. Acad. Sci. USA 93, 10614-10619 (1996)), and were based 
upon gene expression in the budding yeast Saccharomyces cerevisiae during the 

20 diauxic shift (See, e.g., DeRisi et al ("DeRisi"), Science 278, 680-686 (1997)), the 
mitotic cell division cycle (See, e.g., Spellman et al ("Spellman"), Mol Biol Cell 9, 
3273-3297 (1998)), sporulation (See, e.g., Chu et al ("Chu"), Science 282, 699-705 
(1998)), and temperature and reducing shocks. The disclosures of each of these 
references are incorporated herein by reference in their entireties. In Eisen, RNA 

25 from experimental samples (taken at selected times during the process) were labeled 
during reverse transcription with a red-fluorescent dye Cy5, and mixed with a 
reference sample labeled in parallel with a green-fluorescent dye Cy3. After 
hybridization and appropriate washing steps, separate images were acquired for each 
fluorophor, and fluorescence intensity ratios obtained for all target elements. The 

30 experimental data were provided in an MxN matrix structure, in which the M rows 
represented all genes for which data had been collected, the TV columns represented 
individual array experiments (e.g., single time points or conditions), and each entry 
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represented the measured Cy5/Cy3 fluorescence ratio at the corresponding target 
element on the appropriate array. All ratio values were log-transformed to treat 
inductions and repressions of identical magnitude as numerically equal but opposite in 
sign. In Eisen, it was assumed that the raw ratio values followed log-normal 
distributions and hence, the log-transformed data followed normal distributions. 

The gene similarity metric employed in this publication was a form of 
a correlation coefficient. Let Gy be the (log-transformed) primary data for a gene G in 
condition f. For any two genes X and Y observed over a series of N conditions, the 
classical similarity score based upon a Pearson correlation coefficient is: 



■4 /7.1V 



where 



-JS {Gi ~ $ offset)* \ 

and G 0 ff S et is the estimated mean of the observations, z.e., 



. t=l 



15 

2>g is the (rescaled) estimated standard deviation of the observations. In the Pearson 
correlation coefficient model, G 0 p e t is set equal to 0. Nevertheless, in the analysis 
described in Eisen, "values of G 0 ffm which are not the average over observations on G 
were used when there was an assumed unchanged or reference state represented by 
20 the value of G 0 jf seh against which changes were to be analyzed; in all of the examples 
presented there, G 0 ff se t was set to 0, corresponding to a fluorescence ratio of 1.0." To 
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distinguish this modified correlation coefficient from the classical Pearson correlation 
coefficient, we shall refer to it as Eisen correlation coefficient. Nevertheless, setting 

i 

Goffset equal to 0 or 1 results in an increase in false positives or false negatives, 
respectively. 
5 SUMMARY OF THE INVENTION 

The present invention relates generally to systems, methods, and 
software arrangements for determining associations between one or more elements 
contained within two or more datasets. An exemplary embodiment of the systems, 
methods, and software arrangements determining the associations may obtain a 

1 0 correlation coefficient that incorporates both prior assumptions regarding two or more 
datasets and actual information regarding such datasets. For example, an exemplary 
embodiment of the present invention is directed toward systems, methods, and 
software arrangements in which one of the prior assumptions used to calculate the 
correlation coefficient is that an expression vector mean ju of each of the two or more 

15 datasets is a zero-mean normal random variable (with an a priori distribution 

N(0,r 2 )), and in which one of the actual pieces of information is an a posteriori 

distribution of expression vector mean ju that can be obtained directly from the data 
contained in the two or more datasets. The exemplary embodiment of the systems, 
methods, and software arrangements of the present invention are more beneficial in 

20 comparison to conventional methods in that they likely produce fewer false negative 
and/or false positive results. The exemplary embodiment of the systems, methods, 
and software arrangements of the present invention are further useful in the analysis 
of microarray data (including gene expression arrays) to determine correlations 
between genotypes and phenotypes. Thus, the exemplary embodiments of the 

25 systems, methods, and software arrangements of the present invention are useful in 
elucidating the genetic basis of complex genetic disorders (e.g., those characterized by 
the involvement of more than one gene). 

According to the exemplary embodiment of the present invention, a 
similarity metric for determining an association between two or more datasets may 

30 take the form of a correlation coefficient. However, unlike conventional correlations, 
the correlation coefficient according to the exemplary embodiment of the present 
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invention may be derived from both prior assumptions regarding the datasets 



(including but not limited to the assumption that each dataset has a zero mean), and 
actual information regarding the datasets (including but not limited to an a posteriori 
distribution of the mean). Thus, in one the exemplary embodiment of the present 
5 invention, a correlation coefficient may be provided, the mathematical derivation of 
which can be based on James-Stein shrinkage estimators. In this manner, it can be 
ascertained how a shrinkage parameter of this correlation coefficient may be 
optimized from a Bayesian point of view, e.g., by moving from a value obtained from 
a given dataset toward a "believed" or theoretical value. For example, in one 
1 0 exemplary embodiment of the present invention, G 0 jfset of the gene similarity metric 
described above may be set equal to yG , where y is a value between 0.0 and 1.0. 
When y = 1.0, the resulting similarity metric may be the same as the Pearson 
correlation coefficient, and when y = 0.0, it may be the same as the Eisen correlation 
coefficient. However, for a non-integer value of y (i.e., a value other than 0.0 or 1.0), 

15 the estimator for G 0 ff set = yG can be considered as the unbiased estimator G 
decreasing toward the believed value for G 0 jf sei . This optimization of the correlation 
coefficient can minimize the occurrence of false positives relative to the Eisen 
correlation coefficient, and the occurrence of false negatives relative to the Pearson 
correlation coefficient. 

20 According to an exemplary embodiment of the present invention, the 

general form of the following equation: 




(1) 



where 
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$G = 



1 * 2 

uE^i- ^offset) and 



Goffset = 7<? for Ge{X,Y} 



(2) 



can be used to derive a similarity metric which is dictated by the data. In a general 
setting, all values Xy for gene j may have a Normal distribution with mean 9j and 
standard deviation # (variance fif); i.e., Xy ~ Nty/lf) for f = 1,.. with j fixed (1 < 
5 7 < A/), where 0 y - is an unknown parameter (taking different values for different j). For 
the purpose of estimation, 0j can be assumed to be a random variable taking values 
close to zero: dj ~ N(Q, x 2 ). 

In one exemplary embodiment of the present invention, the posterior 
distribution of 0y may be derived from the prior N(0,i?) and the data via the application 
10 of James-Stein Shrinkage estimators. dj then may be estimated by its mean. In another 

exemplary embodiment, the James-Stein Shrinkage estimators are W and ft 2 . 

In yet another exemplary embodiment of the present invention, the 
posterior distribution of 0 y -may be derived from the prior N(Q, t 2 ) and the data from the 
Bayesian considerations. 0j then may be estimated by its mean. 

1 5 The present invention further provides exemplary embodiments of the 

systems, methods, and software arrangements for implementation of hierarchical 
clustering of two or more datapoints in a dataset. In one preferred embodiment of the 
present invention, the datapoints to be clustered can be gene expression levels 
obtained from one or more experiments, in which gene expression levels may be 

20 analyzed under two or more conditions. Such data documenting alterations in the 
gene expression under various conditions may be obtained by microarray-based 
genomic analysis or other high-throughput methods known to those of ordinary skill 
in the art. Such data may reflect the changes in gene expression that occur in 
response to alterations in various phenotypic indicia, which may include but are not 

25 limited to developmental and/or pathophysiological (i.e., disease-related) changes. 
Thus, in one exemplary embodiment of the present invention, the establishment of 
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genotype/phenotype correlations may be permitted. The exemplary systems, 
methods, and software arrangements of the present invention may also obtain 
genotype/phenotype correlations in complex genetic disorders, z.e., those in which 
more than one gene may play a significant role. Such disorders include, but are not 
5 limited to, cancer, neurological diseases, developmental disorders, 
neurodevelopmental disorders, cardiovascular diseases, metabolic diseases, 
immunologic disorders, infectious diseases, and endocrine disorders. 

According to still another exemplary embodiment of the present 
invention, a hierarchical clusterinjg pseudocode may be used in which a clustering 

10 procedure is utilized by selecting the most similar pair of elements, stalling with 
genes at the bottom-most level, and combining them to create, a new element. In one 
exemplary embodiment of the present invention, the "expression vector" for the new . 
element can be the weighted average exemplary of the expression vectors of the two 
most similar elements that were combined. In another embodiment of the present 

15 invention, the structure of repeated pair-wise combinations may be represented in a 
binary tree, whose leaves can be the set of genes, and whose internal nodes can be the 
elements constructed from the two children nodes. 

In another preferred embodiment of the present invention, the t 
datapoints to be clustered may be values of stocks from one or more stock markets 

20 obtained at one or more time periods. Thus, in this preferred embodiment, the 
identification of stocks or groups of stocks that behave in a coordinated fashion 
relative to other groups of stocks or to the market as a whole can be ascertained. The 
exemplary embodiment of the systems, methods, and software arrangements of the 
present invention therefore may be used for financial investment and related activities. 

25 For a better understanding of the present invention, together with other 

and further objects, reference is made to the following description, taken in 
conjunction with the accompanying drawings, and its scope will be pointed out in the 
appended claims. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

For a more complete understanding of the present invention and its 
advantages, reference is now made to the following description, taken in conjunction 
with the accompanying drawings, in which: 
5 Figure 1 is a first exemplary embodiment of a system according to the 

present invention for determining an association between two datasets based on a 

■ 

combination of data regarding one or more prior assumptions about the datasets and 
actual information derived from such datasets; 

Figure 2 is a second exemplary embodiment of the system according to 
10 the present invention for determining the association between the datasets; 

Figure 3 is an exemplary embodiment of a process according to the 
present invention for determining the association between two datasets which can 
utilize the exemplary systems of Figures 1 and 2; 

Figure 4 is an exemplary illustration of histograms generated by 
15 performing in silico experiments with the four different algorithms, under four 
different conditions; 

Figure 5 is a schematic diagram illustrating the regulation of cell-cycle 
functions of yeast by various translational activators (Simon et al, Cell 106: 67-708 
(2001)), used as a reference to test the performance of the present invention; 
20 Figure 6 depicts Receiver Operator Characteristic (ROC) curves for 

each of the three algorithms Pearson, Eisen or Shrinkage, in which each curve is 
parameterized by the cut-off value 0<= {1.0,0.95,.. .,-L0}; 

Figures 7A-B show FN (Panel A) and FP (Panel B) curves, each 
plotted as a function of 6\ and 
25 Figure 8 shows ROC curves, with threshold plotted on the z-axis. 

DETAILED DESCRIPTION OF THE INVENTION 

An exemplary embodiment of the present invention provides systems, 
methods, and software arrangements for determining one or more associations 
30 between one or more elements contained within two or more datasets. The 
determination of such associations may be useful, inter alia, in ascertaining 
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coordinated changes in a gene expression that may occur, for example, in response to 
alterations in various phenotypic indicia, which may include (but are not limited to) 
developmental and/or pathophysiological (i.e., disease-related) changes establishment 
of these genotype/phenotype correlations can permit a better understanding of a direct 
5 or indirect role that the identified genes may play in the development of these 
phenotypes. The exemplary systems, methods, and software arrangements of the 
present invention can further be useful in elucidating genotype/phenotype correlations 
in complex genetic disorders, i.e., those in which more than one gene may play a 
significant role. The knowledge concerning these relationships may also assist in 
10 facilitating the diagnosis, treatment and prognosis of individuals bearing a given 
phenotype. The exemplary systems, methods, and software arrangements of the 
present invention also may be useful for financial planning and investment. 

Figure 1 illustrates a first exemplary embodiment of a system for 
determining one or more associations between one or more elements contained within 

i 

t 

. 15 two or more datasets. In this exemplary embodiment, the system includes a 
processing device 10 which is connected to a communications network 100 {e.g., the 
Internet) so that it can receive data regarding prior assumptions about the datasets 
and/or actual information determined from the datasets. The processing device 10 can 
be a mini-computer (e.g., Hewlett Packard mini computer), a personal computer (e.g., 

20 a Pentium chip-based computer), a mainframe computer (e.g., IBM 3090 system), and 
the like. The data can be provided from a number of sources. For example, this data 
can be prior assumption data 110 obtained from theoretical considerations or actual 
data 120 derived from the dataset. After the processing device 10 receives the prior 
assumption data 110 and the actual information 120 derived from the dataset via the 

25 communications network 100, it can then generate one or more results 20 which can 
include an association between one or more elements contained in one or more 
datasets. 

Figure 2 illustrates a second exemplary embodiment of the system 10 
according to the present invention in which the prior assumption data 110 obtained 
30 from theoretical considerations or actual data 120 derived from the dataset is 
transmitted to the system 10 directly from an external source, e.g., without the use of 
the communications network 100 for such transfer of the data. In this second 
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exemplary embodiment of the system 10, it is also possible for the prior assumption 
data 110 obtained from theoretical considerations or the actual information 120 
derived from the dataset to be obtained from a storage device provided in or 
connected to the processing device 10. Such storage device can be a hard drive, a CD- 
5 ROM, etc, which are known to those having ordinary skill in the art. 

Figure 3 shows an exemplary flow chart of the embodiment of the 
process according to the present invention for determining an association between two 
datasets based on a combination of data regarding one or more prior assumptions 
about and actual information derived from the datasets. This process can be 

10 performed by the exemplary processing device 10 which is shown in Figures 1 or 2. 
As shown in Figure 3, the processing device 10 receives the prior assumption data 
110 (first data) obtained from theoretical considerations in step 310. In step 320, the 
processing device 10 receives actual information 120 derived from the dataset (second 
data). In step 330, the prior assumption (first) data obtained 110 from theoretical 

15 considerations and the actual (second) data 120 derived from the dataset are combined 
to determine an association between two or more datasets. The results of the 
association determination are generated in step 340. 

I. OVERALL PROCESS DESCRIPTION 

The exemplary systems, methods, and software arrangements 
20 according to the present invention may be {e.g., as shown in Figures 1-3) used to 
determine the associations between two or more elements contained in datasets to 
obtain a correlation coefficient that incorporates both prior assumptions regarding the 
two or more datasets and actual information regarding such datasets. One exemplary 
embodiment of the present invention provides a correlation coefficient that can be 
25 obtained based on James-Stein Shrinkage estimators, and teaches how a shrinkage 
parameter of this correlation coefficient may be optimized from a Bayesian point of 
view, moving from a value obtained from a given dataset toward a "believed" or 
theoretical value. Thus, in one exemplary embodiment of the present invention, G 0 ff set 

may be set equal to yG , where y is a value between 0.0 and 1.0. When y = 1.0, the 

30 resulting similarity metric y may be the same as the Pearson correlation coefficient, 

10 
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and when y = 0.0, y may be the same as the Eisen correlation coefficient. For a non- 
integer value of y {i.e., a value other than 0.0 or 1.0), the estimator for G 0 jf se t = yG can 

be considered as an unbiased estimator G decreasing toward the believed value for 
Goffseu Such exemplary optimization of the correlation coefficient may minimize the 
occurrence of false positives relative to the Eisen correlation coefficient and minimize 
the occurrence of false negatives relative to the Pearson correlation coefficient. 



n. 



EXEMPLARY MODEL 

A family of correlation coefficients parameterized by 0 < y < 1 may be 



10 defined as<follows: 



(i) 



where 



N 



1 ^ 

^ 4 ~ & offset) 2 and 

£=1 



O off3et = 7 G? for Ge{X,Y} 



(2) 



— 1 #n 

In contrast, the Pearson Correlation Coefficient uses G 0 ff set - G = — Vg,. for every 

15 gene G, or y = 1 , and the Eisen Correlation Coefficient uses G 0 jf set = 0 for every gene 
G, or y = 0. 

In an exemplary embodiment of the present invention, the general form 
of equation (1) may be used to derive a similarity metric which is dictated by both the 
data and prior assumptions regarding the data, and that reduces the occurrence of false 
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positives (relative to the Eisen metric) and false negatives (relative to the Pearson 

r 

correlation coefficient). 



5 SETUP 

As described above, the metric used by Eisen had the form of equation 
(1) with G offset set to 0 for every gene G (as a reference state against which to measure 
the data). Nevertheless, even if it is initially assumed that each gene G has zero mean, 
such assumption should be updated when data becomes available. In an exemplary 
10 embodiment of the present invention, gene expression data may be provided in the 
form of the levels of M genes expressed under N experimental conditions. The data 
can be viewed as 

f fy iA r \M 

where M>N and \X{j }^L} is the data vector for geney. 

15 DERIVATION 

S may be rewritten in the following notation: 




In a general setting, the following exemplary assumptions may be made regarding the 
20 data distribution: let all values Xg for gene j have a Normal distribution with mean 6j 
and standard deviation /? y (variance /?/); i.e., Xy ~ N{0pfij) for i = 1,...^V, with j fixed 
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(1 Sj < M), where Oj is an unknown parameter (taking different values for different j). 
For the purpose of estimation, Oj can be assumed to be a random variable taking 
values close to zero: dj ~ N(0, t 2 ). 

It is also possible according to the present invention to assume that the 
5 data are range-normalized, such that fif = /? 2 for every j. If this exemplary assumption 
does not hold true on a given data set, it can be corrected by scaling each gene vector 
appropriately. Using conventional methods, the range may be adjusted to scale to an 
interval of unit length, i.e., its maximum and minimum values differ by 1 . Thus, ~ 
N(e Jf j3/) md0j ~N(6 >T 2 ). 
10 Replacing (Xj) 0 j?set in equation (3) by the exact value of the mean dj 

■ 

* 

may yield a Clairvoyant correlation coefficient of Xj and X*. Nevertheless, because 9j 
is a random variable, it should be estimated from the data. Therefore, to obtain an 
explicit formula for iSpC/,X*), it is possible to derive estimators g. for all j. 

In Pearson correlation coefficient, dj may be estimated by the vector 
15 mean X .j\ and the Eisen correlation coefficient corresponds to replacing 6j by 0 for 
every j, which is equivalent to assuming dj ~ iV(0,0) (i.e., x 2 = 0). In an exemplary 
embodiment of the system, method, and software arrangement according to the 
present invention, an estimate of dj (call it ) may be determined that takes mtcffj 
account both the prior assumption and the data. 

20 

» 

ESTIMATION OF d i 
a. N=l 

First, it is possible according to the present invention to obtain the 
posterior distribution of 6j from the prior N^t 1 ) and the data. This exemplary 

25 derivation can be done either from the Bayesian considerations, or via the James-Stein 
Shrinkage estimators (See, e.g., James et al ("James"), Proc. 4th Berkeley Symp. 
Math. Statist. Vol. 1, 361-379 (1961); andHoffinan, Statistical Papers 41(2), 127-158 
(2000), the disclosures of which are incorporated herein by reference in their 
entireties). In this exemplary embodiment of the present invention, the Bayesian 

30 estimators method can be applied, and it may initially be assumed that N= 1, ie., 
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there is one data point for each gene. Moreover, the variance can initially be denoted 
by o 2 , such that: 



Xj~N{dj,c 2 ) 
0,~N(6,t 2 ) 



(4) 
(5) 



For the sake of clarity, the probability density function (pdf) of 9j can be denoted by 
7c( ), and the pdf of Xy can be denoted by /(•). Based on equations (4) and (5), the 
following relationships may be derived: 



f{X 3 \9 5 ) = 



\/2 



7T(7 



exp(-(Xj-ej) 2 /2* 2 ) 



10 ByBayes' Rule; the joint pdf of^- and 8j may be given by 



f(Xj,0 j } = f(X j \9 j ) „{$ d ) 



2tvgt 



exp 



+ 



2cr 2 




Then f(Xj), the marginal pdf of Xj maybe 



(6) 



exp 



( 



-X". 



2 



2{cr 2 + t-2) 



)• 



(7) 



where the equality in equation (7) is written out in Appendix A.2. Based again on 
1 5 Bayes' Theorem, the posterior distribution of 0j may be given by: 
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f{Xj\e s ) M<9j) 



2k 



or g r 2 



exp 



by (6) 



2n 



( &*T*\ 



(See Appendix A.3 for derivation of equation (8).) 
Since this has a Normal form, it can be determined that: 



(8) 



■2 i —2 



cr 2 r 2 

^•2 _|_ ft 



0j then may be estimated by its mean. 



b. 



WIS ARBITRARY 



(9) 



In contrast to above where N was selected to be 1, if N is selected to be 
10 arbitrary and greater than 1, X 7 - becomes a vector X.j. It can be shown using 

likelihood functions that the vector of values {Xy} £ , with Xy ~ N(Qj, jf) may be 
treated as a single data point ^=^=2-!" — X i} IN from the distribution 
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n(0j,J3 2 /n) (see Appendix A.4). Thus, following the above derivation with c 2 = 
fP/N, a Bayesian estimator for 6j may be given by E(9j\X.J): 



3 ~\ p 2 /n+t*) y >- 



(10) 



However, equation (10) may likely not be directly used in equation (3) because t 2 and 
5 jj 2 may be unknown, such that x 2 and /? 2 should be estimated from the data. 

i 

c. ESTIMATION OF l/(B 2 /NW) 

In this exemplary embodiment of the present invention, let 



M - 2 

W = 



EM v 2* 

(11) 

This equation for Wis obtained from James-Stein estimation. W may be treated as an 
10 educated guess of an estimator for 1/(/? 2 /N+t 2 ), and it can be verified that W is an 
appropriate estimator for l/^/N+r 2 ), as follows: 



r 2 Af(0, 1) + ^V(0, 1) 



(12) 

The transition in equation is set forth in Appendix A.5. If we let a 2 =/? 2 /JV4t 2 , then 
from equation (12) it follows that: 

15 
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y y. 
3 - 3 JV{0.1), 



2 QJ 



and hence 



M M 



7=1 V / 



where ^ is a Chi-square random variable with M degrees of freedom, By 
5 properties of the Chi-square distribution and the linearity of expectation, 



/ or 2 \ 1 

E 2 j = M _ 2 ( see Appendix A. 6) 



Thus, is an unbiased estimator of l/(fi 2 /N+T 2 ), and can be used to replace 
V(f/N+x 2 ), in equation (10). 



10 



d. ESTIMATION OF £ 

It can be shown (e.g., see Appendix A.7) that: 
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10 



N-l 

1 — 1 



is an unbiased estimator for /? based on the data from gene j 9 and that has 
a Chi-square distribution with (N-l) degrees of freedom. Since this is ~^S^ 

3 

the case for every j 9 a more accurate estimate for /J 2 is obtained by pooling all 
available data, i e. , by averaging the estimates for each j: 



-i M 1 M f i N 

^=1 j=l \ i=l 



2 




may be an unbiased estimator for /? 2 , because 



EOS 2 ) - E 



I M 1 Af 



Substituting the estimates (11) and (13) into equation (10), an explicit estimate for Bj 
may be obtained; 
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-(■ 
-(■ 
-(■ 
-(■ 

(14) 

Further, 9 } from equation (14) may be substituted into the correlation coefficient in 

equation (3) wherever (Aj) 0 ffset appears to obtain an explicit formula for S(X.j,X. k ). 
5 CLUSTERING 

» 

In an exemplary embodiment of the present invention, the genes may 
be clustered using the same hierarchical clustering algorithm as used by Eisen, except 

that G offset is set equal to yG , where y is a value between 0.0 and 1.0. The 

hierarchical clustering algorithm used by Eisen is based on the centroid-linkage 

10 method, which is referred to as "an average-linkage method" described in Sokal et ah 
("Sokal"), Univ. Kans. Set Bull 38, 1409-1438 (1958), the disclosure of which is 
incorporated herein by reference in its entirety. This method may compute a binary 
tree (dendrogram) that assembles all the genes at the leaves of the tree, with each 
internal node representing possible clusters at different levels. For any set of M 

1 5 genes, an upper-triangular similarity matrix may be computed by using a similarity 
metric of the type described in Eisen, which contains similarity scores for all pairs of 
genes. A node can be created joining the most similar pair of genes, and a gene 

19 
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2? 
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expression profile can be computed for the node by averaging observations for the 
joined genes. The similarity matrix may be updated with such new node replacing the 
two joined elements, and the process may be repeated (M -1) times until a single 
element remains. Because each internal node can be labeled by a value representing 
5 the similarity between its two children nodes (i.e., the two elements that were 
combined to create the internal node), a set of clusters may be created by breaking the 
tree into subtrees (e.g., by eliminating the internal nodes with labels below a certain 
predetermined threshold value). The clusters created in this manner can be used to 
compare the effects of choosing differing similarity measures. 

10 

HI. ALGORITHM & IMPLEMENTATION 

An exemplary implementation of a hierarchical clustering can proceed 
by selecting the most similar pair of elements (starting with genes at the bottom-most 
level) and combining them to create a new element. The "expression vector" for the 

15 new element can be the weighted average of the expression vectors of the two most 
similar elements that were combined. This exemplary structure of repeated pair-wise 
combinations may be represented in a binary tree, whose leaves can be the set of 
genes, and whose internal nodes can be the elements constructed from the two 
children nodes. The exemplary algorithm according to the present invention is 

20 described below in pseudocode. 

HIERARCHICAL CLUSTERING PSEUDOCODE 

Switch: 
25 Pearson: 7= 1; 
Eisen: y = 0; 
Shrinkage: { 

Compute W = (M - 2) 

Compute p= E;ii£,"i fa -Xj) 2 / (M(N - 1)) 
j = l-W -P 2 /N 

} 
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While (# clusters > 1) do 



0 Compute similarity table: 



S(Gj,G k ) = 



Z l}Pf -( G j)offse,)( G ik -(G k ) ojSrset ) 



(14) 



where {G,) ojrsel =yGi. 
0 Find 0**, k*) : 

S{Gj*,G k *) > S(Gj,G k ) V clusters k 
0 Create new cluster Nj*t* 

= weighted average of Gj* and G/c*. 

■ 

0 Take out clusters/* and £*> 
IV. MATHEMATICAL SIMULATIONS AND EXAMPLES 



a. IN SILICO EXPERIMENT 

To compare the performance of these exemplary algorithms, it is 
possible to conduct an in silico experiment. In such an experiment, two genes X and Y 
can be created, and N (about 100) experiments can be simulated, as follows: 



where a u chosen from a uniform distribution over a range [L, H] (U(L, H)), can be a 
"bias term" introducing a correlation (or none if all a's are zero) between X and Y. 6 X 
~ N(0,x 2 ) and 9y~N(0,i 2 \ are the means of Zand 7, respectively. Similarly, ax and 
ay are the standard deviations for Zand Y 9 respectively. 



Xi - Ox +(rx(o*[X i Y)+Af (0)1)), and 
Y { = By +cry(a i (X i y)+A/ r (0,l)), 
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With this model 



S(X,Y) - if; 




^ V(o,+^(0,l))(ai+^(0,l)) 




if the exact values of the mean and variance are used. The distribution of S is denoted 
by F(jx } 8), where fx is the mean and 5 is the standard deviation. 



("Wolfram"), The Mathematica Book. Cambridge University Press, 4th Ed. (1999), 
the disclosure of which is incorporated herein by reference in its entirety). The 
following parameters were used in the simulation: N = 100, x <s {0.1, 10.0} 
(representing very low or high variability among the genes), a x = a Y = 10.0, and a = 0 

10 representing no correlation between the genes or a ~ U(0, 1) representing some 
correlation between the genes. Once the parameters were fixed for a particular in 
silico experiment, the gene-expression vectors for X and Y were generated several 
thousand times, and for each pair of vectors S<J(X, Y), S P (X, Y), S e (X, Y), and S s (X t Y) 
were estimated by four different algorithms and further examined to see how the 

15 estimators of S varied over these trials. These four different algorithms estimated S 
according to equations (1) and (2), as follows: Claii-voyant estimated S c using the true 

values of 0 X , &y, errand a Y ; Pearson estimated S p using the unbiased estimators X and 

7 of a x , and a Y (for X 0 j? set and Y 0 ff 5et ), respectively; Eisen estimated S e using the value 
0.0 as the estimator of both ax, and a Y \ and Shrinkage estimated S s using the shrunk 
20 biased estimators 6 X and B Y of 6 X and 0 Y , respectively. In the latter three, the 
standard deviation was estimated as in equation (2). The histograms corresponding to 
these in silica experiments can be found in Figure 4 (See Below). The information 
obtained from these conducted simulations, is as follows: 



5 



The model was implemented in Mathematica (See Wolfram 
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When Xmd Fare not correlated and the noise in the input is low (N = 
100, x = 0.1, and a = 0), Pearson performs about the same as Eisen, Shrinkage, and 
Clairvoyant (^-^(-0.000297,0.0996), S p ~ F(-0.000269,0.0999), S e ~ 
F(-0.000254,0.0994), and S s ~ ^(-0.000254,0.0994)). 

When X and Y are not correlated, but the noise in the input is high (N = 
100, x = 10.0, and a = 0), Pearson performs about as well as Shrinkage and 
Clairvoyant, but Eisen introduces a substantial number of false-positives (S c ~ 
F(-0.00097 1,0.0994), Sp ~ F(-0.000939,0.100), S e ~F(-a00119, 0.354), and S s ~ 
^(-0.000939,0.100)). 

When X and Y are correlated and the noise in the input is low (N = 1 00, 
t = 0.1, and a ~ C/(0,1)), Pearson performs substantially worse than Eisen, Shrinkage, 
and Clairvoyant, and Eisen, Shrinkage, and Clairvoyant perform about equally as 
well. Pearson introduces a substantial number of false-negatives (S c - F(0.331,0.132), 
^-^(0.0755,0.0992), S e ~ F(0.248, 0.0915), and S s ~,F(0.245, 0.0915)). 

Finally, when X and Y are correlated and the noise in the input is high, 
the signal-to-noise ratio becomes extremely poor regardless of the algorithm 
employed (S c - F(0.333, 0.133), S p - F(0.0762,0.100), £ e ~F(0.117, 0.368), and S s ~ 
F(0.0762, 0.0999)). 

In summary, Pearson tends to introduce more false negatives and Eisen 
tends to introduce more false positives than Shrinkage. Exemplary Shrinkage 
procedures according to the present invention, on the other hand, can reduce these 
errors by combining the positive properties of both algorithms. 

b. BIOLOGICAL EXAMPLE 

Exemplary algorithms also were tested on a biological example. A 
biologically well-characterized system was selected, and the clusters of genes 
involved in the yeast cell cycle were analyzed. These clusters were computed using 
the hierarchical clustering algorithm with the underlying similarity measure chosen 
from the following three: Pearson, Eisen, or Shrinkage. As a reference, the computed 
clusters were compared to the ones implied by the common cell-cycle functions and 
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regulatory systems inferred from the roles of various transcriptional activators (See 
description associated with Figure 5 below). 

The experimental analysis was based on the assumption that the 
groupings suggested by the ChIP (Chromatin ImmunoPrecipitation) analysis are 
5 correct and thus, provide a direct approach to compare various correlation 
coefficients. It is possible that the ChlP-based groupings themselves contain several 
false relations (both positives and negatives). Nevertheless, the trend of reduced false 

» 

positives and false negatives using shrinkage analysis appears to be consistent with 
the mathematical simulation set forth above. 

10 In Simon et al ("Simon"), Cell 106, 697-708 (2001), the disclosure of 

which is incorporated herein by reference in its entirety, genome-wide location 
analysis is used to determine how the yeast cell cycle gene expression program is 
regulated by each of the nine known cell cycle transcriptional activators: Ace2, Fkhl, 
Fkh2, Mbpl, Mcml, Nddl, Swi4, Swi5, and Swi6. It was also determined that cell 

15 cycle transcriptional activators which function during one stage of the cell cycle 
regulate transcriptional activators that function during the next stage. According to an 
exemplary embodiment of the present invention, these serial regulation transcriptional 
activators, together with various functional properties, can be used to partition some 
selected cell cycle genes into nine clusters, each one characterized by a group of 

» 

20 transcriptional activators working together and their functions (see Table 1). For 
example, Group 1 may characterized by the activators Swi4 and Swi6 and the 
function of budding; Group 2 may be characterized by the activators Swi6 and Mbpl 
and the function involving DNA replication and repair at the juncture of Gl and S 
phases, etc. 

25 The hypothesis in this exemplary embodiment of the present invention 

can be summarized as follows: genes expressed during the same cell cycle stage (and 
regulated by the same transcriptional activators) can be in the same cluster. Provided 
below are exemplary deviations from this hypothesis that are observed in the raw 
data. 

30 

Possible False Positives : 
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Bud9 (Group 1: Budding) and {Ctsl, Egt2} (Group 7: Cytokinesis) can 

be placed in the same cluster by all three metrics: P49 = S82 a E47; however, the 

Eisen metric also places Exgl (Group 1) and Cdc6 (Group 8: Pre-replication complex 
formation) in the same cluster. 

Mcm2 (Group 2: DNA replication and repair) and Mcm3 (Group 8) 
can be placed in the same cluster by all three metrics: P10 = S20 ^ E73; however, the 
Eisen metric places several more genes from different groups in the same cluster: 
{Rnrl, Rad27, Cdc21, Dunl, Cdc45} (Group 2), Hta3 (Group 3: Chromatin), and 
Mcm6 (Group 8) are also placed in cluster E73. 

Table 1 : Genes in our data set, grouped by transcriptional activators 
and cell-cycle functions. 





Adtivalons 




Functions 




SwH, SvriS 


Clnl, Oln2, Gicl, Gic2, 
MsbS, Rsrl, BudD, 

Mnnl, Ochi, J2xj$l r 
KteG. C\vpl 


Budding 


2 


SwiG, Mbpl 


ClhS, Clbfi, Rnrl, 
Had27, Cdc21, Dual, 
UadSL CdciS, Mcml 


DNA rcplieahian 
and repair 




Swit, S\v\G 


Hcbt, Hcb2 5 HtaJ, 
Htn2, Hta3, Hftal 


Chromatin 


4 


Flch! 


Hhf.l ? Hl.il.1,7tel2,A;rp7 


Curarnatin 


*> 

D 


Fkhl 


Tmnl 


Mitosis CtaulnDJ 


fi 


Nddl, FJch2, 
Mcml 


Clh2 t Acc% SwiS, 
Cdc20 


Mitosis GonUtt] 


r 


Acg2, Swia 


Ctel, Egt2 


Oy^okinasiH 


s 


Meinl. 


Meni3, McrnC, OdcS, 
CtlirfG 


Pra-raplication 
coniplox formation 


0 


Mcml 


Sue2. ttirl 


Mating ' 



Possible False Negatives : 

Group 1 : Budding (Table 1) may be split into four clusters by the Eisen 
metric: {Clnl, Cln2, Gic2, Rsrl, Mnnl} e Cluster a (E39), Gic2 e Cluster b (E62), 
{Bud9, Exgl} e Cluster c (E47), and {Kre6, Cwpl} e Cluster d (E66); and into six 
clusters by both the Shrinkage and Pearson metrics: {Clnl, Cln2, Gic2, Rsrl, Mnnl} 
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6 Cluster a (S3=P66), {Gicl, Kre6} e Cluster b (S39=PI7), Msb2 e Cluster c 
(S24=P71), Bud9 e Cluster d (S82=P49), Exgl e Cluster e (S48-P78), and Cwpl e 
Cluster/(S8=P4). 

Table 1 contains those genes from Figure 5 that were present in an 

' 5 evaluated data set. The following tables contain these genes grouped into clusters by 
an exemplary hierarchical clustering algorithm according to the present invention 
using the three metrics (Eisen in Table 2, Pearson in Table 3, and Shrinkage in Table 
4) threshold at a correlation coefficient value of 0.60. The choice of the threshold 
parameter is discussed further below. Genes that have not been grouped with any 

10 others at a similarity of 0.60 or higher are not included in the tables. In the 

• ■ * 

subsequent analysis they can be treated as singleton clusters. 

Table 2: Eisen Clusters 





SwU/EvdG 


Glut, Chi2 s Gie2, Hsrl, Mnnl 


ES2 


Swi4/SwiC 


GicJ 


E47 


BnU/Mfi 


BudQ, Exgl 




Acb2/Bw15 






Menri 


Cdcfi 


EGG 


Swi4/S«iG 


TCrcG, Gwpl 


EM 


Svrifi/MbpL 


Clb5 r Olba RadSi 




Fkhi 






Nddl/FUi2/Mcuil 


cmc20 




Maul 


CdclS 


E?3 


SwiG/Mbpi 


RnrJ > Rad2^ Cdc21, Duiil, 






Gdc4S, Mcm2 




Svri4/5wl6 






Mcml 


AfaiiS, McraS 


E63 


Swi4/S«iG 


HfcbJ, Htb'i, HtaX Hta2, Hhai 




Fklil 


HhFl.Hhfcl 


ESQ 


FkUl 


ArpT 


E3S 


Fklit 


Tctul 




Nddl/Fkli2Afcml 


C!b2, SwIS 


ESI 


Mcml 


Stn2, Fhrl 



26 



WO 2004/097577 



PCT/US2004/012921 



Tablo 3: Pearson Clusters 



P6G 




Olnl, Gln2, Gic2, Hsrl, Mtml 


PI? 


SwM/SwiS 


Gtal, Krafi 


P?l 


S\vi4/SwiS 




pag 


SwM/SwiG • 


BudS 






Ctst, Egf:2 


PIS 


S\vv\/SmG 


Exgl 


P4 




Cwp! 


P12 


Swifl/Mbpt 


ClbS, CltJfi, RnrJ, Cdc21, Dunl, 






Uridyl, Cticte 




Swi<l/SwiG 


Hta3 




Fkhl 


r Ibl2 




Nddt/Fkh2/Memi 


Cdc20 




Mcml 


Memfij Gdci6 


PJO 


SvviS/Mbpl 


Mem 2 






Mem 3 


P54 


Swi4/SwiS 


HtbL Htb2, Hial, Hta2> Eliot / 




Pktii 


Hhfl, Hhtl 


par 


Fkhl 


ArpT 


PIG 


Nddt/Fkh2/Mcml 


CJb2, Aoe3, SwiS 


P50 


Mcral 


Ste2, Ffcrl 



0 
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IhMs 4: Shrinkage Clustms 



S3 


5wi4/Swi6 


Clnl r CJn2 5 Gic2, Rsrl. Miini 


sag 


SwM/SwiG 


GjcI, Krcfi 




Swid/SwiG 


Msb2 


SS2 


SvcttfSvtiB 


Bud3 




Ace2/SwiS 


Cfcsl, Egt2 


SIS 


Stari4/5\vi6 


ExrI 


ss 


Swi4/S«riS 


Gwpl 


S14 


S«?l6/Mbpl 


ClbS, CltaG, Rnrl, Cd&U Diral, 






Radal, GeMS 




Fkhl 


Tnl2 




Nddl/Pkb2/Mcml. 


Gdc20 




Manl 


MoriG, Gde4G 


sao 


Swifi/Mbpl 


Man2 




Maul 


Mcm3 


S4 


Swul/SvdG 


lilhl. Htb2 5 Hta.t, Hto2, Hbol 




FkliL 


HItfl, Hhtl 


S13 


Sivi4/S«,i6 




563 


Fkhl 


ArpF 


S22 


Nddl/Fkh2/Mcrol 


GLb2, Ace2» SwiS 


SS3 


Manl 


8182,, Farl 



The value y = 0.89 estimated from the raw yeast data appears to be greater than a y 
value based equation [1]. Moreover, the value y = 0 performed better than y = L 
Such value also appears not to have yielded as great an improvement in the yeast data 
clusters as the simulations indicated. This exemplary result indicates that the true 
5 value of y may be closer to 0. Upon a closer examination of the data, it can be 

observed that it may be possible that the data in its raw "pre-normalized" form is 
inconsistent with the assumptions used in deriving y : 



1 . The gene vectors are not range-normalized, so /?/ ^ ff for every j; and 
10 2. The N experiments are not necessarily independent. 



CORRECTIONS 

» 

The first observation may be compensated for by normalizing all gene 
vectors with respect to range (dividing each entry in gene X by (X max - X min )\ 
15 recomputing the estimated, value, and repeating the clustering process. As 
normalized gene expression data yielded the estimate y =0.91 appears to be too high 
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a value, an extensive computational experiment was conducted to determine the best 
empirical y value by also clustering with the shrinkage factors of 0.2, 0.4, 0.6, and 

0.8. The clusters taken at the correlation factor cut-off of 0.60, as above, are 
presented in Tables 5, 6, 7, 8, 9, 10 and 1 1 . 



Ikble 5: RN Data, 7 = D.O (Eison Clusters) 



ES 


Swii/SwiG 


Gbil^Msba.Mxml 


all 




uiu2, Jtsrl 




BwiS/Mbpl 


ClbS, CluG, Rnrl, Rad37> Cdc!2J, 
Dunl, RariSl, Ctie-ia 






HUi3 




Pklil 


Tel2 




Nddl/Mtf/Mcml 


Cde20 




Mcrnl 


Mcmfi, QticAG 


El-i 


Sra.4/5wi6 


Gicl 


KIT 


Swii/Swifl 


Eud9 




Acc2/Bwi5 


GCKl,Ksti2 




Mcrnl 


Ste2. Fhrt 


1516 


SwM/Swi6 


Exgl 


H59 


S\vi4/Swi6 




eis 


Svifi/Mbpl 


Mcrn2 






Mcm3 


E86 


SvM/Snifi 


Hrtil, Hfcb2, Hurt, Hfcn2, Htal 




Pklil 


HhEl, Hhfcl 


EtO 


Fldil 


ArnT 


RID 


Ffclil 


Teml 




NddiyPkM/Mciiil 


Clb2, Acq2, SwiS 


Ell 


Mcml 


Cdcfi 



5 
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lable 0: Range-nQnna.llzGd data* 7 = D.2 





SwM/SviG 


Clnl. Gic2. HetI. Mnnl 




Swi»i/S\vi6 


Chi 2 




SvriG/Munl 


QbG Rnrt RadS? Odc2I Diuii 






RadSl, Cdc/15 


£02 23 


Swi4/Swi6 

w 


Gicl 




SwU/SwiG 


Bud9 




Acc2/Ssvifi 


Gtsl t Bgt2 




SvU/5n£6 


Exgl 




FMil. 


ArnF 




Swi<l/SwiG 




Sa.aiS 


Swi6/Mbpl 


ClbS 




SvvM/SviG 


Hta3 




Pkht 


Tola 




Ntkn/Fkn2/Mant 


Cdc20 




Maul 


Mcmfi, CdcM 




SwiG/Mupi 






Mem 1 


McinS 


So. 2 25 




HtbJ, Htb2. Uca.1, Htn2 t Hhol 




Pkht 


HhEl, llhtl 


Sfc.229 


Fkht 


T Vcml 




Nddl/FlehS/Meitil 


db3 r Acc2 ( Sw& 


&h2 f * 


Mctnl 


Ste2 


So.2o5 


Mcrnl 


Farl 
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r Ebbla 7: Umge-norraalfzGcI data, 7 = 0,4 





Swaa/SwiG 


Clnl, Gic2„ Rsrl„ Muni 






CIn3 




Swifi/Mbpl 


ClhS, CUbfi. Hurl, Rad27 r Cdf 21. 






Dunl, Radal, Cdc4a 






Htii3 




Fttnl 


Tq13 




\f IJ.1 f I Jit A 

Nddl/Fkh2/Mcml 


Cdc'lQ 






AlcmG, Odc46 




a,„; ,i >o~;p 


oicl, ftrce 






Msb2 


3a,HG 




Bud9 




Aoa2/Svxri5 


Ctsl, Egt2 




Swi4/Swifi 


Ex*t 


S d< j2 


SwiG/Mbpl 


Mcm2 




Mcmi 


Mem3 


So ,4*13 




Htbl, Ht.b2 t Hml, Hta2 s Mhol 




Fwa 


Hhfl, Hhfcl 




Fkhl 


Arp7 




Fkhl 


Toml 




Nddl/FkJQ/Mcml 


Clb2, Ao22, SwiS 


Su4 16 


Mcml 


GdeG 




Mcml 


SteG 




Mciul 
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Tablo Sr. TiangG-norrnallzed data> 7 = O.G 



Sbjs34 


SwM/SndG 


Clnt r Gic2, RftI, Mnul 




5wi4/Swifi 


Cln2 




SwlG/Mbpl 


ClbS, Clbfi, Rnri, ftacttS', Cdc21, 






Duul, RjuISI, OdcdS 




SwrW/Swi6 


Htu2 




Pklil 


TW2 




NdcU/Fkh2/Mdul 


Cdc20 




Manl 


Mcnifi, Cdc46 


C OK 




Gicl, Krcfi 




SwW/SwiG 


Meb2 




Swd4/5wifi 


Bud9 




Ace2/Svi5 


Ctsl, E£& ! 


WO 


SwrM/SwiG 


Exgl 




SwiG/Mbpl 


Mcm2 




Maul 


Mnm3 




5vi4/SwiG 


Hthl, Hfcb2, Htai, Hta2, Hhal 




Pfclil 


HJifJ, Hljld 




Fklii 


Arp7 


Sq q3T 


Nddl/Fkh2/Manl. 


Clb2 P AcD2 s SivB 




Mcml 


Sic2 




Manl 
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Tbhle 9: Rangp-normalizfid data, 7 = 0-8 



Sd.kS1 


SwW/SwiG 


Clni.GicO, Rsrl, Miml 




Swi4/Swi6 


Obi2 




SwiB/Mbpl 


C3hS, CJbfi, Rurl, Rad2r, 0^21, 






Dun I, RndSl, Cdc43 




aWl4/SWMa 


Hta.3 




'ni.i.ii 


Tel2 




« dal /Fku2 / fclcml 


Gdc2Q 




fiicm 1 


A'lemB;, Cdc4G 




Sift la/bWJti 


Gicl. KreB 


Sn.sSO 


SwM/SwiB 


Mst?2 


So.fi31 


SwM/SwIS 


Bud9 




Aca2/SvviS 


CUil t Eg:2 




SwU/SwiG 






Swi4/SwiG 


Cvvpl 




Swifi/Mbpl 


Man 2 




Haul 


Man3 


So.sl* 


SwM/SwiG 


Htbl, H!b2, Htat, 11^2, Hhol 




Fkltl 


Hhfl, Hhtl 




Fklil 


Arp? 


Sn.sM 


NtIdl/Pkti2AMcni1 


CIb2, AogQ, SiviS 


Sg .g33 


Mem 1 






Meml 


Fnr.l j 
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Table 20: RW Data, 7 ~ D.91 (Shrinkage Cluster) 



549 


5wi4/Swifi 


Glut. Gic2, RjstI, Muni 


B73 


BwI4/Swifi 


Cln2 




Swie/Mbpl 


Clb5 r ClbG, Rnrl, Hjid27, Cdc21, 






Dunl\nnd51, Octets 






Hta3 




Pkiil 


HeI2 




Nddl/PkUa/Mcrat 


Cdc20 




Mcml 


MemG, Cde4G 




SwM/Stfi6 


Gici, Kig6 






MfiDJ 


590 


SwU/SaiG 


BudD 




Ace2/Swifi 


Ctsl, Egt2 


556 


Swi4/SwiS 


Exgl 


SJ6 


Swtd/5vlfi 


Cwpl 


SV1 


SvrlS/Mbpl 


Mcm2 






Mcjti3 


951 


Swi4/Svifi 


HtM, Htb2> Htal, Hta2, Hhol 




FkhL 


HhEl, Hltul 


ssr 


Bdil 


Arp7 


57 


'Nddl/PWi2/Mcml 


0^2, Aco2, Swla 




Mem I 


Sta2 


£92 


ft-ternl 


Fail 



34 



WO 2004/097577 



PCT/US2004/012921 



Table 11: EN Data, 7 = LO (Pearson Clusters) 



P10 


SivU/SmB 


Glut, Gic2, iterl, Muni 


P6S 


Bwi4/Swi"G 


Cln2 




Bwifi/Mbpl 

> * 


GlbS, ClbB, Rnrl, Rad2T, Ctlc21, 






Dun^RidSl, CckrtS 






Hta3 




Fkht 


WJ 






Cdc20 




Mcrn J 


McmC, Cdcdfi 


Pi 


Swi4/5wi6 


Gicl, KreG 


O on 


5wi 4/SwiB 


Mfib2 


PG6 


Swi4/Srci6 






Ace2/Svri5 


Ctfil, Egfc2 


P2Q 


S»i4/Swi6 


Exgl 


F2 


&vL-d/Swi6 


Crcpl 


PT2 


SwL6/Mhpl. 


Mem2 




Man 1 


Maid 


P53 


5wi«i/5wifi 


tffcbl, Htb2, Htnl, Hta2, Ilto! 






HbEl, Uhiil 


P12 


Fkhi 


ArpV 


P46 


NdcJJ /FMi2/Mcml 


(352, Aea2. SwiS 


PGI 


Maul 




PG5 


Man! 
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To compare the resulting sets of clusters, the following notation may 
be introduced. Each cluster set may be written, as follows: 



r % # of groups 

\ x -> {y 2y z 2 }> {y n . x , Zn x }} f 

where x denotes the group number (as described in Table 1), n x is the number of 
5 clusters group x appears in, and for each cluster j € {1, . . . , n x } ? where are^- genes 
from group x and zj genes from other groups in Table 1 . A value of "*" for zj denotes 
that clustery contains additional genes, although none of them are cell cycle genes; in 
subsequent computations, this value may be treated as 0. 

This notation naturally lends itself to a scoring function for measuring 
10 the number of false positives, number of false negatives, and total error score, which 
aids in the comparison of cluster sets. 

Fp (y)= tXSjv*/ (15) 

L x j=l 

FN(y) = £ £ y r y k (16) 

* \<.j(k<n x 

* 

Error_score(y ) = FP(y) +iW(y) (17) 



T 


= 0.0 (M} ==> 


{1 




{{s, + }, is, is} , o , *}, {x, * j , 

£1, + } P {1, 4} , {1, □}, O}, {I, D», 


2 




£{8,7}, £1,1} } s 


3 


• 


{(5,2} t {l,l4}} ( 


4 




££2, S} s £3,14-}, £!,*}}, 


5 




££l,3», 


6 




££S, l},£l,l4}} t 


7 




{£2,3}]., 


8 


— 


££2,13} t £l, 1},£1,0}}, 


9 




££2, 3} } 


} 







Erro.r_soana(O.D) = D7 ■+■ SS = 155 
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7 = 0.2 ==> 

{l,l},{l ( 2} i {UO} > {l,0} > {J,0}} ( 

2 -t {{7,1},{1,5},{1,1}}, 

3 -t {{5,2}, {1,5}}, 

4 -t {{2,5}, {1,5}, {1,1}}, 

5 f {{1.3}}, 
t {{3,1}, {1,5}}, 

- {{2,1}}, 

-» {{a, 4}, {i,i}, {i t o}}, 

-r {{1,*.},{1,*}} 



6 
7 
S 

g 
} 



EiTor-scorefO^) = 38 + 94 = 132 

In such notation, the cluster sets with their error scores can be listed as 

follows: 



7 = 0.4 => 
{1 -t {{-1,+}, {1, 13}, {1, *},{!,.}, 
{2, >J, {1,2}, {1,0}, {1,0}}, 

2 •-» {{S,B},{1,1}}, 

3 -* {{5,2},{1,13}}, 

4 -r {{2, 5}, {.1,13}, {1 ,*.}}, 

5 -r {{1,3}}, 

(i -» {{3,1}, {1, 13}}, 

7 -r {{2,1}}, 

B -t {{2, 12}, {1,*}, {1,1}}, 

9 -» {[1, *},{!,*}} 

} 

Error-sqoj.\a(0.4) = 7S-I-S6 = IC4 



7 =0.6 



{1 


i' 


{R *}t {1,13}, {1, *}.{!,*}, 






{2, «}, {1,2], {1,0}, {1,0}}, 


2 




{{8,6}, {1,1}}, 


3 




{{S,2},{1,13}}, 


4 


> 


{{2,5}, {1,13}, {!,*}}, 


5 




{P>°}}i 


G 


— ► 


{{3,*}, {1,13}}, 


7 




{{2,1}}. 


3 




{{2, 12}, {1,1}, {1,0}}, 


9 




{{!,*}, {!,»}} 


} 







ErrDr_scare(a6) = 75 + 86 =s 1G1 



Em>r_score(0.6) = 75 + 86 = 161. 



7 = CLQl^) => 



{1 -* 


{{4, »■}, {1, 13}{1, ♦•}, £1, *]• , 
*}, {2„ *}, {1, 2}, {1, 0}}, 




{{8,6}, {1,1)}, 


3 — * 

4 — »■ 


{{5, 2}, {1, 13}}, 

{.{2 , S } , { 1 , 1 3} , {1 ., ■*}}-, 


S -» 




C3 — * 


£{3,*}, {1,13}}, 


T — (■ 


{{2, 1}}, 


S — f 


{{2,l2},{l r .1},{l,0}}, 


9 — f 
} 


{{!,*}, {1,*}} 
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7 






— * 








{l.*M2.*}.fr2},{1.0tt, 


2 


— > 


{{8,6}, {1,1}}, 




— > 


{{5, 2}, {.1,13}}, 


4 




{{2,5},{l,13},-ri,*}}, 


5 




{{1.0}}, 


6 


— i 


{{3, +},{!, 13}}, 


7 


> 


{{2, l}h 


8 




CUM*."}}, 


9 


— > 


{{i. *}. {i, *}} 



} 

Errai , ^ODiio(0.8) = 75 -I- SB = 181 



Error_score(0.91) = 75 + 86 = 161. 



7 = l.D(/>) => i 
{I {1,13}, {!,+}, {1,+}, 

{J,*}, {2,*}. {1,2}. tl.0}}, 

2 {{8,6), {1,1}}, 

3 -* {{S,2},{1,13}}, 

4 -r {{2,5}, {1,13}, {1,+}}, 
3 -* {fl>0}}, 

6 -> {{3,*}, {1,13}}, 

7 -> {{2J}}. 

5 -» {{2,12},{1,1},{1,0}}, 

} 

EiTorjscoTe(1.0) = 75 +■ 86 = 161 



In this notion, y values of 0.8, 0.91, and 1.0 provide substantially identical cluster 
5 groupings, and the likely best error score may be attained at y = 0.2. 

To improve the estimated value of y , the statistical dependence among 
the experiments may be compensated for by reducing the effective number of 
experiments by subsampling from the set of all (possibly correlated) experiments. 
The candidates can be chosen via clustering all the experiments, i.e., columns of the 
10 data matrix, and then selecting one representative experiment from each cluster of 
experiments. The subsampled data may then be clustered, once again using the cut- 
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off correlation value of 0.60. The exemplary resulting cluster sets under the Eisen, 
Shrinkage, and Pearson metrics are given in Tables 12, 13, and 14, respectively. 



Table 12: UN Subsampled Data, 7 = 0.0 (Eisen) 



ESS 


SwU/SvnB 




ESS 


Swi4/SwiS 


C!n2, Msb2> Rsrl, BudO, Mrml, 




Swlfi/Mbpl 


Rtirl, Had27, Cdc2L Dunl, 
KndSl, Gdc45 t Mcm2 




ilV» l-l J Dili 1U 


"W tihl Wfrh 1 ? t-Tfrnl HrnO H1\-i1 
nwui, rnjDx., iiLAA, ritkJi, rnui 




Fk\a 


HhEl.tthtLArp* 




Fkld. 


Ternl 




Nddl/Fkh2/Mcrai 


Clb2, Ace2, SwiS 




Acc2/Swi5 
Mcral 


I3gt2 

Mcm3, ManC, CdcG 


B2D 


SwH/SwiB 


Gicl 


EGi 


Svi4/S\viG 


Gtc2 j 


E33 


Swi4/SwiG 

Swlfi/Mbpl 

SwW/Swifi 


Kit»C, Cwpl I 

ClbS ? Clbfi 

Hta3 

GUc20 

CdedG 




Fkiil. 


•B2L2 


mx 


Ace2/Swi5 


GdHi 




Mcml 


Sto2 


E66 


tvferal 


Rirl 
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Tabla 13: RN Subsarnplsd Data, 7 = O.m (ShrinkifiG) 



Sd9 


Swi4/SiviG 


Cltil, BiidG, Odil 




Acc2/Swi5 


Egfc2 




Mcral 


Cdcfi 


S6 


Sm«l/SwiG 

■ 


Cln2, Gic2 f Msb2, Rsrl, Mnnl, 
Kxgl 




Swl6/Mbpl 


Hurt, Ifcid27, Cdn21, Diml, 
IlndSl, CdcdS 


S32 


Swi4/5raG 


Gicl 


5GS 


Swifi/Mbpl 


ICrcD, Cwpi 
Glb5. Clhfi 
Tel2 




Ndai/Fkh2/Mcml 
ft fern 1 


Gdc20 
Gdcdfi 


SIS 


Swi6/Mbpl 
Maui 


Mcm2 
Man3 


sn 


SwM/r5\viG 


HilM, IItb2 5 Htal r Htn2, Hlwl 




Fkhl 


HhFi, Hhrl 


SGO 


S\vi<i/S«riG 




S30 


Mil 


ATpr 




Nddl/Fkli2/\Icinl 


Clb2 r Acn2 ? Swi5 


SG2 


Fklil ! 


'iTGIlll 


SS! 


Aci>2/S\vi5 




314 


Mcral 


McrnG 


S35 


Maul 


SUs2 • 




Maul 


Earl 
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Table 14: RN Subsampled Data, 7 = LO (Peaison) 



PI 


BwW/SWB 


CJnl, Ochl 


Pis 


Swii/Swifi 


Chtf. Hsrl, iitonl 




SwiG/Mbpl 


Cdc2L> DuuX, Rad5 1, CdeOS, Man2 




Mcml 


Mcm3 


P29 


Swii/Swi6 


Ginl 


P2 


Swii/SwiG 


Gto2 


P3 


Swid/Swifi 


Msb2, Exgl 




Swifi/Mbpl 


Rnrl 


PS I 


BwU/SviS 


BwB 




Nfldl/Pkha/Mcml 


CM, Aoa2, 3wI5 




Aee2/Svri5 


Egfc2 




Mcml 


Cdtfi 


PU 


Swi4/SwiG 


ICreS 


PG2 


SvM/SwU 


Cwpl 




SwlG/Mbpi 


CJbS, 011)6 




SwW/Svvifi 


Htsi3 




N r tldl/Pkh2/fr!nnil 


Cdc20 




Mcml 


Cdc46 


pjg 


SwiG/Mbpl 


Had2t 




SwH/SwlB 


Htbl, Htb2, Mai, 111*2, Hltol 




PkM 


Hhil, HIM 


P10 


FkM 


TO 2 




Mcml 


McraG 


P23 


FkJil 


Arp? 


F^Q 


Pkhl 


r Jfctut 


PG9 






P42 


Mcml 


Ste2 









The subsampled data may yield the lower estimated value ~ 0.66. In the 
exemplary set notation, the resulting clusters with the corresponding error scores can 
be written as follows: 



41 



WO 2004/097577 



PCT/US2004/012921 



^ = aa(JET) =^ 



1 


— ? 


{{6, 23}, {2, *}, 12,5}, {1, *}}, 


2 


— * 


{{7, 22}, {2,5}}, 


3 


— » 


{{S, 24}, {1,6}}, 


4 




{{3, 2D}, {!,*}}, 


S 


— r 


{{1.28}}, 


6 




{{3, 30}, {1,6}}, 


7 


— r 


{{l,+} ! {l,28}} t 


B 


— t 


{{3,20} ( {1,6}}, 


9 


— > 


{{1. *}.. {!,*}} 



} 

Enor.score(aO) = 370 + 7fl = 449 



i 
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7 = 0.60(5) => 
{I - {{6..6},{3 > 2} ) {2 ( 5},{l^}} r {1,«}}, 

2 -r {{6,6}, {2, 5}, {1,1}}, 

3 -r {{5,2}, {1,.}], 

4 {{2,SK{1,3} 1 {1 1 0}] > 

5 -r {{1,4}. 

0 -r {{3,1},{1 } 0>}, 

S ~* {{1,4^3}. {M),{M}}, . 

9 -» {{l.-MWJ 

I 

Error-5core(0,6G) = 70 + 88 = 164 



7 = 1.0(7') =* 

{l {{3,G},{2 I 4,{2 ) i},{i»*}< 

{!,♦},{! ,*}.{3.S}.{1,S}}, 
2 - {{5,4K{2,4} t {l 1 2},{.l ( 7}}, 

4 -» {{2, 6}, {1, ♦.},{!, 1}}, 

5 ~> {{1,*}}, 
0 [{3,3}, {US}} t 
7 -» {{U+},{1,S}} > 

S -, £{Ul}.{l,fi)»fl.5hD.8»« 
9 -* {{!,*}, P.*}} 
} 

Error jsc»i.ia(lcO) = BO + 1D7 = 170 



From the tables for the range-normalized, subsampled yeast data, as 
well as by comparing the error scores, it appears that for the same clustering 
algorithm and threshold value, Pearson introduces more false negatives and Eisen 
introduces more false positives than Shrinkage. The exemplary Shrinkage procedure 
according to the present invention may reduce these errors by combining the positive 
properties of both algorithms. This observation is consistent with the mathematical 
analysis and simulation described above. 



GENERAL DISCUSSION 
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Microarray-based genomic analysis and other similar high-throughput 
methods have begun to occupy an increasingly important role in biology, as they have 
helped to create a visual image of the state-space trajectories at the core of the cellular 
processes. Nevertheless, as described above, a small error in the estimation of a 
5 parameter (e.g., the shrinkage parameter) may have a significant effect on the overall 
conclusion. Errors in the estimators can manifest themselves by missing certain 
biological relations between two genes (false negatives) or by proposing phantom 
relations between two otherwise unrelated genes (false positives). 

A global illustration of these interactions can be seen in an exemplary 
10 Receiver Operator Characteristic ("ROC") graph (shown in Figure 6) with each curve 
parameterized by the cut-off threshold in the range of [-1,1]. The ROC curve (see, 
e.g., Egan, J.P., Signal Detection TJteory and ROC analysis, Academic Press, New 
York. (1975), the entire disclosure of which is incorporated herein by reference in its 
entirety) for a given metric preferably plots sensitivity against (1 -specificity), where: 

15 

Sensitivity = fraction of positives detected by a metric 

a my) 
TP(y) + FN(y) ' 

Specificity = fraction of negatives detected by a metric 
TN(y) + FP(f) ' 

20 and TP(y), FN(y), FP(y) and TN(y) denote the number of True Positives, False 
Negatives, False Positives, and True Negatives, respectively, arising from a metric 
associated with a given y. (Recall that y is 0.0 for Eisen, 1.0 for Pearson, and may be 
computed according to equation (14) for Shrinkage, which yields about 0.66 on this 
data set.) For each pair of genes, {j,k}, we can define these events using our 

25 hypothesis as a measure of truth: 

TP: {/, k) can be in same group (see Table 1) and {/, k) can be placed in same cluster; 

FP: {/, k) can be in different groups, but {/, k) can be placed in same cluster; 

TN: {j\ k} can be in different groups and {/, k) can be placed in different clusters; and 
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FN: {/, k} can be in same group, but {/, k} can be placed in different clusters, 

FP(y) and FN(y) were already defined in equations (15) and (16), respectively, and we 

define 

^(Y) = Zi^) (18) 
5 and 

■ 

TN{y ) = Total - (TP(y ) + FN(y ) + FP(y )) (19) 

where Total - (?)= 946 is the total # of gene pairs {/, k} in Table 1. 

The ROC figure suggests the best threshold to use for each metric, and can also be 
used to select the best metric to use for a particular sensitivity. 
10 The dependence of the error scores on the threshold can be more 

clearly seen from an exemplary graph of Figure 7, which shows that a threshold value 
of about 0.60 is a reasonable representative value. 

B. FINANCIAL EXAMPLE 

15 The algorithms of the present invention may also be applied to 

financial markets. For example, the algorithm may be applied to determine the 
behavior of individual stocks or groups of stocks offered for sale on one or more 
publicly-traded stock markets relative to other individual stocks, groups of stocks, 
stock market indices calculated from the values of one or more individual stocks, e.g., 

20 the Dow Jones 500, or stock markets as a whole. Thus, an individual considering 
investment in a given stock or groups of stocks in order to achieve a return on their 
investment greater than that provided by another stock, another group of stocks, a 
stock index or the market as a whole, could employ the algorithm of the present 
invention to determine whether the sales price of the given stock or group of stocks 

25 under consideration moves in a correlated way to the movement of any other stock, 
groups of stocks, stock indices or stock markets as a whole. If there is a strong 
association between the movement of the price of a given stock or groups of stocks 
and another stock, another group of stocks, a stock index or the market as a whole, the 

45 



WO 2004/097577 



PCT/US2004/0 12921 



prospective investor may not wish to assume the potentially greater risk associated 

« 

with investing in a single stock when its likelihood- to increase in value may be limited 
by the movement of the market as a whole, which is usually a less risky investment. 
Alternatively, an investor who knows or believes that a given stock has in the past 
5 outperformed other stocks, a stock market index, or the market as a whole, could 
employ the algorithm of the present invention to identify other promising stocks that 
are likely to behave similarly as future candidates for investment. Those skilled in the 
art of investment will recognize that the present invention may be applied in 
numerous systems, methods, and software arrangements for identifying candidate 
10 investments, not only in stock markets, but also in other markets including but not 
limited to the bond market, futures markets, commodities markets, etc., and the 
present invention is in no way limited to the exemplary applications and embodiments 
described herein. 

The foregoing merely illustrates the principles of the present invention. 
15 Various modifications and alterations to the described embodiments will be apparent ■ 
to those skilled in the art in view of the teachings herein. It will thus be appreciated 
that those skilled in the art will be able to devise numerous systems, methods, and 
software arrangements for determining associations between one or more elements 
contained within two or more datasets that, although not explicitly shown or described 
. 20 herein, embody the principles of the invention and are thus within the spirit and scope 
of the invention. Indeed, the present invention is in no way limited to the exemplary 
applications and embodiments thereof described above. 
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APPENDIX 

APPENDIX A.1 - RECEIVER OPERATOR CHARACTERISTIC CURVES 

Definitions 

If two genes are in the same group, they may "belong in same cluster", 
and if they are in different groups, they may "belong in different clusters." Receiver 
Operator Characteristic (ROC) curves, a graphical representation of the number of 
true positives versus the number of false positives for a binary classification system as 
the discrimination threshold is varied, are generated for each metric used (i.e., one for 
Eisen, one for Pearson, and one for Shrinkage). 
Event: grouping of (cell cycle) genes into clusters; 

Threshold: cut-off similarity value at which the hierarchy tree is cut into clusters. 
The exemplary cell-cycle gene table can consist of 44 genes, which gives us C(44,2) = 
946 gene pairs. For each (unordered) gene pair {/, k} 9 define the following events: 
TP: ft k} can be in same group and ft k} can be placed in same cluster; 
FP: ft k) can be in different groups, but ft k) can be placed in same cluster; 
TN: {j\k} can be in different groups and \j,k) can be placed in different clusters; and 
FN: ft k] can be in same group, but ft k} can be placed in different clusters. 
Thus, 

FP(y)= YFP{{j,k}) 
I J* } 

W(y)-^7W(0,/c}) 
{J,*} 

FN(y)^^FN({j y k}) 

where the sums are taken over all 946 unordered pairs of genes. 
Two other quantities involved in ROC curve generation can be 
Sensitivity = fraction of positives detected by a metric 
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my) 



TP{y) + FN{y) ' 
Specificity = fraction of negatives detected by a metric 

-m(y) + FP(y)' 

The ROC curve plots sensitivity, on they-axis, as a function of (1- specificity), on the 
5 x-axis, with each point on the plot corresponding to a different cut-off value. A 
different curve was created for each of the three metrics. 



FP(y), and TN(y) can be computed using an exemplary set notation for clusters, with a 
relationship of: 



The following sections describe how the quantities TP(y), FN(y), 




# of groups 



10 



Computations 



A. 




# gene pairs that were placed in same cluster 



15 



and belong in same group. 



For each group x given in set notation as 
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pairs from each yj should be counted, i.e. 9 

2Pw-(f)+...+H=i;fe') 



Obtaining a total over all groups yields 



^groups n x / \ 

tp<j)= £ n>(*)=E£fe') 



tt group s 

I 



B. FN 

F^)=£iW({/VA:}) = 
{/.*} 

# gene pairs that belong in same group 
but were placed into different clusters. 



a-- iSj. % 



i 



SfiJ ~ \ 




J] ]T flj-fo Hua > 2, ear 
v Q ? if ?t 2 = 1. 



Every pair that was separated could be counted 
10 However, when n x = 1, there is no pair (j, k} that satisfies the triple inequality 1 < j < 
k < n x , and hence, it is not necessary to treat such pair as a special case. 



# pups 
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C. FP 

FP(y)=YFP{{j,k})= 

# gene pairs that belong in different groups 
but got placed in the same cluster. 
The expression 



may count every false-positive pair ft */ twice: first, when looking at /s group, and 
again, when looking at k's group. 

, Sir 



D. TN 

4 

TN(y ) = £tN({./,*}) = 

# gene pairs that belong in different groups and got placed in different clusters. 
Instead of counting true-negatives from our notation, the fact that the other three 
scores are known may be used, and the total thereof can also be utilized. 
Complementarity. Given a gene pair {/,&}, only one of the events {TP( {/,£}), 
FN({a£}), FP( {/,£}), TN( {/,*})} may be true. This implies 
£TP({M})+£FN( {/,*}) + 

+ £fP({j,*})+2TN({/,*}) = 

U,k\ \j.k\ 

= TP(y ) + FN(y ) + FP(y ) + TN(y ) = 
= fr)= — = 946 = Total 
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.\ TN(y ) = Total - (TP(y ) + FN(y ) + FP(y )) 

Plotting ROC curves 

For each cut-off value 0, TP(y), FN(y), FP(y), and TN(y) are computed 
as described above, with y e {0.0, 0.66, 1.0} corresponding to Eisen, Shrinkage, and 
Pearson, respectively. Then, the sensitivity and specificity may be computed from 
equations (20) and (21), and sensitivity vs. (1 -specificity) can be plotted, as shown in 
Figure 6. 

The effect of the cut-off threshold 6 on the FN and FP scores 
individually also can be examined, using an exemplary graph shown in Figure 7. 

A 3 -dimensional, graph of (1 -specificity) on the x-axis, sensitivity on 
the>*-axis, and threshold on the z-axis offers a view shown in Figure 8. 
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A.2 Computing the, marginal pdf for X 



/CO 



/CO 
-oc 



1 



2^" 



2 



■*2 



2r2 



■a I » Jl 



1 /* 



(22) 



.Fliisfc,. rewrite fee eocponsnt as a ccanffefca squarec 



2 _j_ ^2 



§ 2 —2 — 



0- -f- r- 



k ~2 



r 2 



0:2 -4* ^ 



0- 



r2 



_2 



2 



T 2 2 / T 2 \ 



= X 



2 



(23) 



(24). 
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Substituting (243 into (23) yiolcfc 



&- 



n 



* 2 



Xji + 



ir a 



* a 



(2S) 



Not use ths cDmplGtDd square in (25) bo CDntinu>2 tha computation in 



2srar 



V Si 



esxp 



0 



.3 



Then 



0 = ±co 



-(*- 



J ? 2g a r a 




/ j gg g r 3 

O i o 



5 
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and 



5 25TCT /.oo ycra-f-rfl V 



■oo 



(26) 
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A. 3 Calculation of the posterior distri- 
bution of S$ 

Since tiha subscript j remains constant throughout the csl&ulaM^n, it 
will be dropped in this appendix. Herein, % will he replaced by 9, and 
X] by X. 



5T (&\X) - 



/<X|g)«;(fl) _ /(X,fl) 

jf(X) 



<l/a gff r) exp [- (^j-^g-)] 



; .y £ 



_1 f jg (Jf-fl) 

5 1^2 + * 



©2 (X - 9) 



«r2 



•2 I /r* 2 



+ X 2 (T 2 ( £ r 9 +r 2 )- t T 2 r 2 )] 
- 2(<t 2 4-- s )0 • t 2 X+t 4 A'- 2 1 



2 
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Therefor^ 



4 2 r 2 



exp 



3-" 
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A. 4 Proof of the fact that n indepen- 
dent OBSERVATIONS FROM THE NORMAL 
POPULATION jV"(0, a 2 ) CAN BE TREATED AS 
A SINGLE OBSERVATION FROM N(8, Cpfn) 

Given ihe data, jtj /(jyr|0).csn be viewed aa a Amotion, of We tbencaU 
it the MkeUkood fimcMon of 9' for given jr., and writa 

When y is a single data point from Jsf(9^ 



1 f &~& \ 
2\ a ) 



Mi 



21 



J 



(28) 



where * is some function of y. 

Now, suppose that y. = , * a ^b) represents a vector of n inde- 
pendent absarca.tLorjs from jSf (&i <r a ). We can denote the sample meaoi 
by 



'It 



Ths UketLhoGd £naGbk>n of & given such n independent ©teerv&tLons 



2cr s 



esp 



Also, since 



it. fellows that 



(29) 



esrcp 



4 



exp 



oc exp 



const w*r..t.. & 
1 



(30) 



L 2<a» 

which 3s s Normal function with mean y and variance & 2 fn* Comparing 
with (28), we cart recognise that this is equivalent to treating the data 
§ as a single observation y with mean 6 and variance ct 2 /n, La, > 
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Proof of (29): 



4~i i 



10 



i £ i 

i \ 4. i J 



****** 



I; . 
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Distribution op the Sum op two Inde- 
pendent Normal Random Variables 

X ~ Jtf(D s « 2 ) 

Y ~ M{Q,0*) 
ha two Independent random variables. 
Claim: X+Y ~ Af(0, a? 

C This result is' used for Normal r.v.'s, although s 

more general nfltt can be proves) " 

Fmof : (use moment generating fimcttana) 



e ; — e da? 



^ [i 2 - 2a 2 te] 



■•• • • 

Completing, tha square, obtain 



/ * das . / (32) 



» * 



- a 2 * 2 .(33) 



Usfrig the result o£ (S3). In (32) yields 



2 



ro* C*3 s= / e S ' dm 

J— !i 



5T£3i J — CO. 



Let §r = 
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With this substitution, we obtain 



1 ^2*2 



mx-(*)= — 7= a/ e tfy 



or 



Similarly 



•oa 



v , ' 



mx-(i)= e 2 (34} 



i 0 2 * 2 

o" ' (35} 



To obtain the distribution of X+ Y, it suffices to compute the cor- 
responding moment generating function: 

m x+ v(t) = E f e v =EI e e J 

/ tX\ f tY\ 

I e J E( e I by independence of X and Y 



= E 



■ m x (t)'m Y (t) 

■■ | a 2 i 2 i 0 2 t 2 
= e" • e J by {34} and (35) 



= e"' 



which is a moment generating function of a Normal random variable 
with mean 0 and variance or 2 + f3 2 . Therefore, 



X + Y ~ N(0 T a*+p 2 ), (36) 



5 



60 



WO 2004/097577 



PCT/US2004/012921 



A*6 Properties of the Chi-square Distri- 
bution 

Lefc Xx.X2> . ,Xfc be Li.d.r.v.'s from standard Normal distribution* 

X;~Af(0 ? l) Vj. 

Then 

k 

Xl =Xt+X$4 hXl = 5^^ 2 

is a random triable from Ghi-square distribution with k degrees of 
freedom* denoted 

Xl~X 2 ( k y 

It has the probability density function 

( 0 otherwise 

where 

r(fe} = y i fc ~ l e-*<ft. (37} 



The result we are using is 



which can be obtained as follows: 



X2 



1 roo i 

/ A 36/2-1 e -»/2 ^ 

2 fc / 2 r(fc/2)/o a 

(38) 
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* * 



Let 



t = x/2 



x = 

X = 



/ z fc / 2 
Jo 



0 

00 



a; 

t 
t 



2t 

2dt 

0 

oo 



= / (2t) fa ' 2 ~ 2 e-< 2dt 

/■CO 



(39) 



Let 



t fc/2-l 
15 = fc/2-1 

Integration by parts transforms {39} into 



for fc > 2 



_ 2 fo /2-l 



fc/2 - 1 



0 Jn fc/2-1 X • 



r o fc/2 



•0 



/ 



2^/2-1 fOQ 

k/2 



="1 L 



fc/2 - 1 



r(&/2), by (37) 

r(fc/2) 



Substituting 4Mb result in (38) yields 



2*/2-lr(fc/2) 



Xl J 2'V2r(fc/2) fc/2 - 1 

1 



2 (fc/2 - 1) 

feT^ forfc>2. (40) 
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A. 7 Distribution of Sample Variance s 2 

Let Xj ~ //(/i, €r 2 ) for j«l, i ,.,nbe independent r.v.'s. We'll derive 
the joint distribution of 



and 



3 



WX.O.Q. can reduce the problem to the case A f (O f l), i.e., fi = 0, 
a 2 = 1: Let Z.j = (Xj - fx) /tr . Then 



Z 



\ <r <r / <r V n J a 



and hence 



(41} 



Also, 



(n- l)s» 2 

«-2 



^E((^-^>+0*-*)) 2 



-J- -f» 



(42) 
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By (41) and (42). it suffices to derive the joint distribution of JnZ 

? 2Vi are i.i.d* from -A/*(0j 1). 

Let 



P = 



Pi 

?>2 



\ 



be anftxn orthogonal matrix where 



and the remaining tows are obtained by, say, applying Gtamm- 
Schmidt bo {pi,e2,e^, , ,e n }. where e^- is a standard unit vector in 
j th direction in TZ n . Let 



Y = PZ 



\ 



1 



1 

— Tanr 



f Yl \ 

Y 2 



\ Y n J 



Then 



1 



n 



nZ 



Since P is orthogonal, it preserves vector lengths: 



Tl 



n 



Er7 



(43) 



E^ 2 )-n 2 = EV-N 2 V (43) 
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Hence 

n n n 



J-2 J=l 



74 



2 



n 



~\2 



Since the I^'s axe mutually independent (by orthogonality of P) y we 
can conclude that 

3=2 3=1 

is independent of 
Yi = -«/nZ. 

Also by orthogonality of P, Y 3 - ~ Af(0 t 1) for j = 1, , . . , n, so 

£ ^ J ~ X(n- i) ( See Appendix A.Q) 

and hence, by (42) and (44), 
(n. - 1) s 2 



2 - Xln-i.) (45) 



Since B (Xl) = k, for ~ X(fcU we can see bhat 
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Also > since 

r 2 



we can conclude that 



i.e., s 2 is an unbiased estimator of the variance a 2 . 



Various publications have been referenced herein, the contents of 
5 which are hereby incorporated by reference in their entireties. It should be noted that 
all procedures and algorithms according to the present invention described herein can 
be performed using the exemplary systems of the present invention illustrated in 
Figures 1 and 2 and described herein, as well as being programmed as software 
arrangements according to the present invention to be executed by such systems or 

1 0 other exemplary systems and/or processing arrangements. 

< 
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WHAT IS CLAIMED IS : 

1. A method for determining an association between a first dataset and a second 
dataset comprising: 

a) obtaining at least one first data corresponding to one or more prior 
5 assumptions regarding said first and second datasets; 

b) obtaining at least one second data corresponding to one or more portions of 
actual information regarding said first and second datasets; and 

c) combining the at least one first data and the at least one second data to 
determine the association between the first and second datasets. 

10 2. The method of Claim 1, wherein one of the one or more prior assumptions is 

that the means of the first and second datasets are random variables with a 
known a priori distribution. 

3. The method of Claim 1, wherein one of the one or more prior assumptions is 
that the means of the first and second datasets are normal random variables 

1 5 with an a priori Gaussian distribution N(\i, t 2 ), wherein \i which is a mean, and 

x which is a variance may be unknown. 

4. The method of Claim 1, wherein one of the one or more prior assumptions is 
that the means of the first and second datasets are normal random variables 
with an a priori Gaussian distribution N(\l, t 2 ), wherein \i is known. 

20 5. The method of Claim 1, wherein one of the one or more prior assumptions is 

that the means of the first and second datasets are zero-mean normal random 
variables with an a priori Gaussian distribution N(\x 9 x 2 ), wherein p==0. 

6. The method of Claim 1 , wherein one of the one or more portions of the actual 
information is an a posteriori distribution of the means of the first and second 

25 datasets obtained directly from the first and second datasets. 

7. The method of Claim 1, wherein the association is a correlation. 

8. The method of Claim 1, wherein the association is a dot product. 

9. The method of Claim 1, wherein the association is a Euclidean distance. 
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10. The method of Claim 7, wherein the determination of the correlation 
comprises a use of James-Stein Shrinkage estimators in conjunction with the 
first and second data. 

1 1 . The method of Claim 1 0, wherein the determination of the correlation utilizes 
5 a correlation coefficient that is modified by an optimal shrinkage parameter y . 

12. The method of Claim 11, wherein determination of the optimal shrinkage 
parameter y comprises the use of Bayesian considerations in conjunction with 
the first and second data. 

13. The method of Claim 11, wherein the shrinkage parameter y is estimated from 
10 the datasets using cross-validation. 

14. The method of Claim 11, wherein the shrinkage parameter y is estimated by 
simulation. 

15. The method of Claim 11, wherein the correlation coefficient includes a 
plurality of correlation coefficients parameterized by 0 < y < 1 and may be 

1 5 defined, for datasets Xj and X k as: 



20 



; J ' offset \ i --**y y--i 



wherein 



= Jf ( X V - C*J) offset) 



16. The method of Claim 15, wherein y _ L_ M-2 ELSL^t-^)^ , 



■v* 

7 
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wherein M represents, in an M x N matrix, a number of rows corresponding to 
datapoints from the first dataset, and N represents a number of columns 
corresponding to datapoints from the second dataset. 

17. The method of Claim 16, wherein M is the number of rows corresponding to 
all genes from which expression data has been collected in one or more 
microarray experiments. 

18. The method of Claim 16, wherein M is representative of a genotype and N is 
representative of a phenotype. 

19. The method of Claim 18, wherein the correlation is a genotype/phenotype 
correlation. 

20. The method of Claim 19, wherein the genotype/phenotype correlation is 
indicative of a causal relationship between a genotype and a phenotype. 

■ 

21. The method of Claim 20, wherein the phenotype is that of a complex genetic 
disorder. 

22. The method of Claim 21, wherein the complex genetic disorder includes at 
least one of a cancer, a neurological disease, a developmental disorder, a 
neurodevelopmental disorder, a cardiovascular disease, a metabolic disease, an 
immunologic disorder, an infectious disease, and an endocrine disorder. 

23. The method of Claim 7 wherein the correlation is provided between financial 
information for one or more financial instruments traded on a financial 
exchange. 

24. The method of Claim 7 wherein the correlation is provided between user 
profiles for one or more users in an e-commerce application. 

25. A software arrangement which, when executed on a processing device, 
configures the processing device to determine an association between a first 
dataset and a second dataset, the software arrangement comprising a 
processing subsystem which, when executed on the processing device, 
configures the processing device to perform the following steps: 
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obtaining at least one first data corresponding to one or more 
prior assumptions regarding said first and second datasets; 
obtaining at least one second data corresponding to one or more 
portions of actual information regarding said first and second 
datasets; and 

combining the at least one first data and the at least one second 
data to determine the association between the first and second 
datasets. 

26. The software arrangement of Claim 25, wherein one of the one or more prior 
assumptions is that the means of the first and second datasets are random 
variables with a known a priori distribution. 

27. The software arrangement of Claim 25, wherein one of the one or more prior 
assumptions is that the means of the first and second datasets are normal 
random variables with an a priori Gaussian distribution N{\i 9 x 2 ), wherein \x 
which is a mean, and x 2 which is a variance may be unknown. 

28. The software arrangement of Claim 25, wherein one of the one or more prior 
assumptions is that the means of the first and second datasets are normal 
random variables with an a priori Gaussian distribution N(\i, x 2 ), wherein \x is 
known. 

29. The software arrangement of Claim 25, wherein one of the one or more prior 
assumptions is that the means of the first and second datasets are zero-mean 
normal random variables with an a priori Gaussian distribution N(\i, x 2 ), 
wherein (jH). 

30. The software arrangement of Claim 25, wherein one of the one or more 
portions of the actual information is an a posteriori distribution of the means 
of the first and second datasets obtained directly from the first and second 
datasets. 

31. The software arrangement of Claim 25, wherein the association is a 
correlation. 



a) 
b) 

c) 
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32. The software arrangement of Claim 25, wherein the association is a dot 
product. 

33. The software arrangement of Claim 25, wherein the association is a Euclidean 
distance. 

5 34. The software arrangement of Claim 31, wherein the determination of the 

correlation comprises a use of James-Stein Shrinkage estimators in 

■ 

conjunction with the first and second data. 

35. The software arrangement of Claim 34, wherein the determination of the 
correlation utilizes a correlation coefficient that is modified by an optimal 

1 0 shrinkage parameter y . 

36. The software arrangement of Claim 35, wherein determination of the optimal 
shrinkage parameter y comprises the use of Bayesian considerations in 
conjunction with the first and second data, 

37. The software arrangement of Claim 35, wherein the shrinkage parameter y is 
1 5 estimated from the datasets using cross-validation. 

38. The software arrangement of Claim 35, wherein the shrinkage parameter y is 
estimated by simulation. 

39. The software arrangement of Claim 35, wherein the correlation coefficient 
includes a plurality of correlation coefficients parameterized by 0 < y < 1 and 

■ 

20 may be defined, for datasets Xj and X k as: 



S(Xj,Xk) 




wherein 
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40. , The software arrangement of Claim 39, wherein y 




7 



where M represents, in an M x N matrix, a number of rows corresponding to 
datapoints from the first dataset, and N represents a number of columns 
corresponding to datapoints from the second dataset. 

41. The software arrangement of Claim 40, wherein M is the number of . rows 
corresponding to all genes from which expression data has been collected in 
one or more micro array experiments. 

42. The software arrangement of Claim 40, wherein M is representative of a 
genotype and N is representative of a phenotype. 

43. The software arrangement of Claim 42, wherein the correlation is a 
genotype/phenotype correlation: 

44. The software arrangement of Claim 43, wherein the genotype/phenotype 
correlation is indicative of a causal relationship between a genotype and a 
phenotype. 

45. . The software arrangement of Claim 44, wherein the phenotype is that of a 

complex genetic disorder. 

46. The software arrangement of Claim 45, wherein the complex genetic disorder 
includes at least one of a cancer, a neurological disease, a developmental 
disorder, a neurodevelopmental disorder, a cardiovascular disease, a metabolic 
disease, an immunologic disorder, an infectious disease, and an endocrine 



47. The software arrangement of Claim 31 wherein the correlation is provided 
between financial information for one or more financial instruments traded on 
a financial exchange. 

48. The software arrangement of Claim 31 wherein the correlation is provided 



disorder. 



between user profiles for one or more users in an e-commerce application. 
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49. A storage medium which includes thereon a software arrangement for 
determining an association between a first dataset and a second dataset, the 
software arrangement comprising a processing subsystem which, when 
executed on the processing device, configures the processing device to 

5 perform the following steps: 

a) obtaining at least one first data corresponding to one or more 
prior assumptions regarding said first and second datasets; 

b) obtaining at least one second data corresponding to one or more 
portions of actual information regarding said first and second 

10 datasets; and 

c) combining the at least one first data and the at least one second 
data to determine the association between the first and second 
datasets. 

50. The storage medium of Claim 49, wherein one of the one or more prior 
15 assumptions is that the means of the first and second datasets are random 

variables with a known a priori distribution. 

51. The storage medium of Claim 49, wherein one of the one or more prior 
assumptions is that the means of the first and second datasets are normal 
random variables with an a priori Gaussian distribution N(\i, x 2 ), wherein |i 

20 which is a mean, and x which is a variance may be unknown. 

52. The storage medium of Claim 49, wherein one of the one or more prior 
assumptions is that the means of the first and second datasets are normal 
random variables with an a priori Gaussian distribution N(\i, x ), wherein 
parameter \i is known. 

25 53. The storage medium of Claim 49, wherein one of the one or more prior 

assumptions is that the means of the first and second datasets are zero-mean 
normal random variables with an a priori Gaussian distribution N(\i, x ), 
wherein |i=0. 
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54. 

55. 
5 56. 
57. 

58. 

10 

59. 
60. 

15 

61. 
62. 

20 

63. 



The storage medium of Claim 49, wherein one of the one or more portions of 
the actual information is an a posteriori distribution of the means of the first 
and second datasets obtained directly from the first and second datasets. 

The storage medium of Claim 49, wherein the association is a correlation. 

The storage medium of Claim 49, wherein the association is a dot product. 

The storage medium of Claim 49, wherein the association is a Euclidean 
distance. 

The storage medium of Claim 55, wherein the determination of the correlation 
comprises a use of James-Stein Shrinkage estimators in conjunction with the 
first and second data. 

The storage medium of Claim 58, wherein the determination of the correlation 
utilizes a con'elation coefficient that is modified by an optimal shrinkage 

9 I 

parameter y . 

The storage medium of Claim 59, wherein determination of the optimal 
shrinkage parameter y comprises the use of Bayesian considerations in 
conjunction with the first and second data. 

The storage medium of Claim 59, wherein the shrinkage parameter y is 

» 

estimated from the datasets using cross-validation. 

The storage medium of Claim 59, wherein the shrinkage parameter y is 
estimated by simulation. 

The storage medium of Claim 59, wherein the correlation coefficient includes 
a plurality of correlation coefficients parameterized by 0 < y < 1 and may be 
defined, for datasets Xj andX* as: 



S (Xj , Xf c ) 




offset 
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wherein 

i 

e 



5 64. The storage medium of Claim 63, wherein y 



= n M - 2 Zf = iE^(^-n-) s .„ 



7 



where M represents, in an M x N matrix, a number of rows corresponding to 
datapoints from the first dataset, and N represents a number of columns 
10 corresponding to datapoints from the second dataset. 

65. The storage medium of Claim 64, wherein M is the number of rows 
corresponding to all genes from which expression data has been collected in 
one or more microarray experiments. 

66. The storage medium of Claim 64, wherein M is representative of a genotype 
1 5 and N is representative of a phenotype. 

67. The storage medium of Claim 66, wherein the correlation is a 
genotype/phenotype correlation. 

68. The storage medium of Claim 67, wherein the genotype/phenotype correlation 
is indicative of a causal relationship between a genotype and a phenotype. 

20 69. The storage medium of Claim 68, wherein the phenotype is that of a complex 

genetic disorder. 

70. The storage medium of Claim 69, wherein the complex genetic disorder 
includes at least one of a cancer, a neurological disease, a developmental 
disorder, a neurodevelopmental disorder, a cardiovascular disease, a metabolic 
25 disease, an immunologic disorder, an infectious disease, and an endocrine 

disorder. 
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71. The storage medium of Claim 55, wherein the correlation is provided between 
financial information for one or more financial instruments traded on a 
financial exchange. 

72. The storage medium of Claim 55, wherein the correlation is provided between 
5 user profiles for one or more users in an e-commerce application. 

73. A system for determining an association between a first dataset and a second 

■ 

dataset comprising: 

a) obtaining at least one first data corresponding to one or more 
prior assumptions regarding said first and second datasets; 
10 b) obtaining at least one second data corresponding to one or more 

portions of actual information regarding said first and second 
datasets; and 

c) combining the at least one first data and the at least one second 
data to determine the association between the first and second 
15 datasets. 

74. The system of Claim 73, wherein one of the one or more prior assumptions is 
that the means of the first and second datasets are random variables with a 

■ 

known a priori distribution. 

75. ! The system of Claim 73, wherein one of the one or more prior assumptions is 
20 that the means of the first and second datasets are normal random variables 

with an a priori Gaussian distribution N(\i, x 2 ), wherein \x which is a mean, and 
t 2 which is a variance may be unknown. 

76. The system of Claim 73, wherein one of die one or more prior assumptions is 
that the means of the first and second datasets are normal random variables 

25 with an a priori Gaussian distribution N(\i, t 2 ), wherein \i is known. 

77. The system, of Claim 73, wherein one of the one or more prior assumptions is 
that the means of the first and second datasets are zero-mean normal random 
variables with an a priori Gaussian distribution N(y,, x 2 ), wherein (i=0. 
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78. The system of Claim 73, wherein one of the one or more portions of the actual 
information is an a posteriori distribution of the means of the first and second 
datasets obtained directly from the first and second datasets. 

79. The system of Claim 73, wherein the association is a correlation. 
5 80. The system of Claim 73, wherein the association is a dot product. 

8 1 . The system of Claim 73, wherein the association is a Euclidean distance. 

i 

82. The system of Claim 79, wherein the determination of the correlation 
comprises a use of James-Stein Shrinkage estimators in conjunction with the 
first and second data. 

10 83. The system of Claim 82, wherein the determination of the correlation utilizes a 

* 

correlation coefficient that is modified by an optimal shrinkage parameter y . 

84. The system of Claim 83, wherein determination of the optimal shrinkage 
parameter y comprises the use of Bayesian considerations in conjunction with 

the first and second data. 

15 85. The system of Claim 83, wherein the shrinkage parameter y is estimated from 

the datasets using cross-validation. 

86. The system of Claim 83, wherein the shrinkage parameter y is estimated by 
simulation. 

87. The system of Claim 83, wherein the correlation coefficient includes a 
20 plurality of correlation coefficients parameterized by 0 < y < 1 and may be 

defined, for datasets Xj and Xk as: 



N 



X v 



Xij - CXj) offaet \ (Xik - (x k ) offaet 



i=l 
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wherein 



88. The system of Claim 87, wherein y 



MN(N - 1) 



M - 2 



££* n 2 




where M represents, in an M x N matrix, a number of rows corresponding to 
datapoints from the first dataset, and N represents a number of columns corresponding 
to datapoints from the second dataset. 

89. The system of Claim 88, wherein M is the number of rows corresponding to 
all genes from which expression data has been collected in one or more 
microarray experiments. 

90. The system of Claim 88, wherein M is representative of a genotype and N is 
representative of a phenotype. 

91. The system of Claim 90, wherein the correlation is a genotype/phenotype 
correlation. 

92. The system of Claim 91, wherein the genotype/phenotype correlation is 
indicative of a causal relationship between a genotype and a phenotype. 

93. The system of Claim 92, wherein the phenotype is that of a complex genetic 



94. The system of Claim 93, wherein the complex genetic disorder includes at 
least one of a cancer, a neurological disease, a developmental disorder, a 
neurodevelopmental disorder, a cardiovascular disease, a metabolic disease, an 
immunologic disorder, an infectious disease, and an endocrine disorder. 



disorder. 
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95 . The system of Claim 79, wherein the correlation is provided between financial 
information for one or more financial instruments traded on a financial 
exchange. 

96. The system of Claim 79, wherein the correlation is provided between user 
5 profiles for one or more users in an e-commerce application. 
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